Goto

Collaborating Authors

 piano roll


Learning and composing of classical music using restricted Boltzmann machines

Kobayashi, Mutsumi, Watanabe, Hiroshi

arXiv.org Artificial Intelligence

We investigate how machine learning models acquire the ability to compose music and how musical information is internally represented within such models. We develop a composition algorithm based on a restricted Boltzmann machine (RBM), a simple generative model capable of producing musical pieces of arbitrary length. We convert musical scores into piano-roll image representations and train the RBM in an unsupervised manner. We confirm that the trained RBM can generate new musical pieces; however, by analyzing the model's responses and internal structure, we find that the learned information is not stored in a form directly interpretable by humans. This study contributes to a better understanding of how machine learning models capable of music composition may internally represent musical structure and highlights issues related to the interpretability of generative models in creative tasks.


Fine-Tuning MIDI-to-Audio Alignment using a Neural Network on Piano Roll and CQT Representations

Murgul, Sebastian, Reiser, Moritz, Heizmann, Michael, Seibert, Christoph

arXiv.org Artificial Intelligence

In this paper, we present a neural network approach for synchronizing audio recordings of human piano performances with their corresponding loosely aligned MIDI files. The task is addressed using a Convolutional Recurrent Neural Network (CRNN) architecture, which effectively captures spectral and temporal features by processing an unaligned piano roll and a spectrogram as inputs to estimate the aligned piano roll. To train the network, we create a dataset of piano pieces with augmented MIDI files that simulate common human timing errors. The proposed model achieves up to 20% higher alignment accuracy than the industry-standard Dynamic Time Warping (DTW) method across various tolerance windows. Furthermore, integrating DTW with the CRNN yields additional improvements, offering enhanced robustness and consistency. These findings demonstrate the potential of neural networks in advancing state-of-the-art MIDI-to-audio alignment.


MIDI-GPT: A Controllable Generative Model for Computer-Assisted Multitrack Music Composition

Pasquier, Philippe, Ens, Jeff, Fradet, Nathan, Triana, Paul, Rizzotti, Davide, Rolland, Jean-Baptiste, Safi, Maryam

arXiv.org Artificial Intelligence

We present and release MIDI-GPT, a generative system based on the Transformer architecture that is designed for computer-assisted music composition workflows. MIDI-GPT supports the infilling of musical material at the track and bar level, and can condition generation on attributes including: instrument type, musical style, note density, polyphony level, and note duration. In order to integrate these features, we employ an alternative representation for musical material, creating a time-ordered sequence of musical events for each track and concatenating several tracks into a single sequence, rather than using a single time-ordered sequence where the musical events corresponding to different tracks are interleaved. We also propose a variation of our representation allowing for expressiveness. We present experimental results that demonstrate that MIDI-GPT is able to consistently avoid duplicating the musical material it was trained on, generate music that is stylistically similar to the training dataset, and that attribute controls allow enforcing various constraints on the generated material. We also outline several real-world applications of MIDI-GPT, including collaborations with industry partners that explore the integration and evaluation of MIDI-GPT into commercial products, as well as several artistic works produced using it.


D3RM: A Discrete Denoising Diffusion Refinement Model for Piano Transcription

Kim, Hounsu, Kwon, Taegyun, Nam, Juhan

arXiv.org Artificial Intelligence

Diffusion models have been widely used in the generative domain due to their convincing performance in modeling complex data distributions. Moreover, they have shown competitive results on discriminative tasks, such as image segmentation. While diffusion models have also been explored for automatic music transcription, their performance has yet to reach a competitive level. In this paper, we focus on discrete diffusion model's refinement capabilities and present a novel architecture for piano transcription. Our model utilizes Neighborhood Attention layers as the denoising module, gradually predicting the target high-resolution piano roll, conditioned on the finetuned features of a pretrained acoustic model. To further enhance refinement, we devise a novel strategy which applies distinct transition states during training and inference stage of discrete diffusion models. Experiments on the MAESTRO dataset show that our approach outperforms previous diffusion-based piano transcription models and the baseline model in terms of F1 score. Our code is available in https://github.com/hanshounsu/d3rm.


MidiTok Visualizer: a tool for visualization and analysis of tokenized MIDI symbolic music

Wiszenko, Michał, Stefański, Kacper, Malesa, Piotr, Pokorzyński, Łukasz, Modrzejewski, Mateusz

arXiv.org Artificial Intelligence

Symbolic music research plays a crucial role in musicrelated machine learning, but MIDI data can be complex 2. SOFTWARE OVERVIEW for those without musical expertise. To address this issue, 2.1 Key functionality we present MidiTok Visualizer, a web application designed to facilitate the exploration and visualization of various MidiTok Visualizer is a web application designed for visualizing MIDI tokenization methods from the MidiTok Python and analyzing MIDI file tokenization techniques package. MidiTok Visualizer offers numerous customizable from the MidiTok Python package. The key capabilities parameters, enabling users to upload MIDI files to visualize of the tool are as follows: tokenized data alongside an interactive piano roll. Allows users to upload a MIDI file and view a graphical representation of the tokens generated by 1. INTRODUCTION


Symbolic Music Generation with Fine-grained Interactive Textural Guidance

Zhu, Tingyu, Liu, Haoyu, Jiang, Zhimin, Zheng, Zeyu

arXiv.org Artificial Intelligence

The problem of symbolic music generation presents unique challenges due to the combination of limited data availability and the need for high precision in note pitch. To overcome these difficulties, we introduce Fine-grained Textural Guidance (FTG) within diffusion models to correct errors in the learned distributions. By incorporating FTG, the diffusion models improve the accuracy of music generation, which makes them well-suited for advanced tasks such as progressive music generation, improvisation and interactive music creation. We derive theoretical characterizations for both the challenges in symbolic music generation and the effect of the FTG approach. We provide numerical experiments and a demo page for interactive music generation with user input to showcase the effectiveness of our approach.


Symbolic Music Generation with Non-Differentiable Rule Guided Diffusion

Huang, Yujia, Ghatare, Adishree, Liu, Yuanzhe, Hu, Ziniu, Zhang, Qinsheng, Sastry, Chandramouli S, Gururani, Siddharth, Oore, Sageev, Yue, Yisong

arXiv.org Artificial Intelligence

We study the problem of symbolic music generation (e.g., generating piano rolls), with a technical focus on non-differentiable rule guidance. Musical rules are often expressed in symbolic form on note characteristics, such as note density or chord progression, many of which are non-differentiable which pose a challenge when using them for guided diffusion. We propose \oursfull (\ours), a novel guidance method that only requires forward evaluation of rule functions that can work with pre-trained diffusion models in a plug-and-play way, thus achieving training-free guidance for non-differentiable rules for the first time. Additionally, we introduce a latent diffusion architecture for symbolic music generation with high time resolution, which can be composed with SCG in a plug-and-play fashion. Compared to standard strong baselines in symbolic music generation, this framework demonstrates marked advancements in music quality and rule-based controllability, outperforming current state-of-the-art generators in a variety of settings. For detailed demonstrations, code and model checkpoints, please visit our project website: https://scg-rule-guided-music.github.io/.


Exploring Latent Spaces of Tonal Music using Variational Autoencoders

Carvalho, Nádia, Bernardes, Gilberto

arXiv.org Artificial Intelligence

Variational Autoencoders (VAEs) have proven to be effective models for producing latent representations of cognitive and semantic value. We assess the degree to which VAEs trained on a prototypical tonal music corpus of 371 Bach's chorales define latent spaces representative of the circle of fifths and the hierarchical relation of each key component pitch as drawn in music cognition. In detail, we compare the latent space of different VAE corpus encodings -- Piano roll, MIDI, ABC, Tonnetz, DFT of pitch, and pitch class distributions -- in providing a pitch space for key relations that align with cognitive distances. We evaluate the model performance of these encodings using objective metrics to capture accuracy, mean square error (MSE), KL-divergence, and computational cost. The ABC encoding performs the best in reconstructing the original data, while the Pitch DFT seems to capture more information from the latent space. Furthermore, an objective evaluation of 12 major or minor transpositions per piece is adopted to quantify the alignment of 1) intra- and inter-segment distances per key and 2) the key distances to cognitive pitch spaces. Our results show that Pitch DFT VAE latent spaces align best with cognitive spaces and provide a common-tone space where overlapping objects within a key are fuzzy clusters, which impose a well-defined order of structural significance or stability -- i.e., a tonal hierarchy. Tonal hierarchies of different keys can be used to measure key distances and the relationships of their in-key components at multiple hierarchies (e.g., notes and chords). The implementation of our VAE and the encodings framework are made available online.


Content-based Controls For Music Large Language Modeling

Lin, Liwei, Xia, Gus, Jiang, Junyan, Zhang, Yixiao

arXiv.org Artificial Intelligence

Recent years have witnessed a rapid growth of large-scale language models in the domain of music audio. Such models enable end-to-end generation of higher-quality music, and some allow conditioned generation using text descriptions. However, the control power of text controls on music is intrinsically limited, as they can only describe music indirectly through meta-data (such as singers and instruments) or high-level representations (such as genre and emotion). We aim to further equip the models with direct and content-based controls on innate music languages such as pitch, chords and drum track. To this end, we contribute Coco-Mulla, a content-based control method for music large language modeling. It uses a parameter-efficient fine-tuning (PEFT) method tailored for Transformer-based audio models. Experiments show that our approach achieved high-quality music generation with low-resource semi-supervised learning, tuning with less than 4% parameters compared to the original model and training on a small dataset with fewer than 300 songs. Moreover, our approach enables effective content-based controls, and we illustrate the control power via chords and rhythms, two of the most salient features of music audio. Furthermore, we show that by combining content-based controls and text descriptions, our system achieves flexible music variation generation and style transfer. Our source codes and demos are available online.


Generating symbolic music using diffusion models

Atassi, Lilac

arXiv.org Artificial Intelligence

Denoising Diffusion Probabilistic models have emerged as simple yet very powerful generative models. Unlike other generative models, diffusion models do not suffer from mode collapse or require a discriminator to generate high-quality samples. In this paper, a diffusion model that uses a binomial prior distribution to generate piano rolls is proposed. The paper also proposes an efficient method to train the model and generate samples. The generated music has coherence at time scales up to the length of the training piano roll segments. The paper demonstrates how this model is conditioned on the input and can be used to harmonize a given melody, complete an incomplete piano roll, or generate a variation of a given piece. The code is publicly shared to encourage the use and development of the method by the community.